Which Who are They? People Attribute Extraction and Disambiguation in Web Search Results∗
نویسندگان
چکیده
People name search often returns a lot of Web pages containing the strings of personal names. Due to namesake, extracting target person attributes (such as birthday, occupation, affiliation, nationality, contact information, etc.) is expected to be helpful to differentiate documents related to different people and thus group documents related to the same person. This paper presents the methodology for the two tasks of Web people disambiguation: target person Attribute Extraction (AE) and people Clustering. Specifically, in this paper we address three questions: (1) How to effectively extract target person attribute information from raw Web pages? (2) Is the information of extracted attributes able to lead to better performance than the information of raw Web pages for Web people clustering? (3) Which is important for Web people clustering, feature representation or clustering algorithms? To solve them, we first present an effective method to extract different types of target person attributes from raw Web pages by using deep Web page cleaning and processing pipelines with multiple techniques including traditional named entities recognition (NER), regular expression patterns, gazetteer-based matching and so on. Then we explore the methodology for Web people clustering from two aspects, i.e., feature representations (tokens from raw Web page, information of extracted attributes) and clustering strategy. The comparative experimental results showed that deep Web page cleaning contributes significantly to performance improvements for target person attribute extraction task. For people clustering task, the clustering algorithm contributes more to performance improvement than feature representations.
منابع مشابه
Automatic Annotation of Ambiguous Personal Names on the Web
Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document co-reference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the web using automatically extracted keywords. Given an ambiguous personal name, first, we...
متن کاملWePS-3 Evaluation Campaign: Overview of the Web People Search Clustering and Attribute Extraction Tasks
The third WePS (Web People Search) Evaluation campaign took place in 2009-2010 and attracted the participation of 13 research groups from Europe, Asia and North America. Given the top web search results for a person name, two tasks were addressed: a clustering task, which consists of grouping together web pages referring to the same person, and an extraction task, which consists of extracting s...
متن کاملPerson Attribute Extraction from the Textual Parts of Web Pages
We present the RGAI systems which participated in the third Web People Search Task challenge. The chief characteristics of our approach are that we focus on the raw textual parts of the Web pages instead of the structured parts, we group similar attribute classes together and we explicitly handle their interdependencies. The RGAI systems achieved top results on the attribute extraction subtask,...
متن کاملWIT: Web People Search Disambiguation using Random Walks
In this paper, we describe our work on a random walks-based approach to disambiguating people in web search results, and the implementation of a system that supports such approach, which we used to participate at Semeval’07 Web People Search task.
متن کاملExploiting Web querying for Web People Search in WePS2
Searching for people on the Web is one of the most common query types to the web search engines today. However, when a person name is queried, the returned result often contains webpages related to several distinct namesakes who have the queried name. The task of disambiguating and finding the webpages related to the specific person of interest is left to the user. Many Web People Search (WePS)...
متن کامل